Skip to content

Conversation

abdelrahman882
Copy link
Contributor

@abdelrahman882 abdelrahman882 commented Sep 29, 2025

What type of PR is this?

/kind bug

What this PR does / why we need it:

Proactive scale up is listing the controllers, check the desired replicas and inject fake pods to proactively scale up even before these pods are created and marked as unschedulable.

Proactive scale up is using the following formula for the number of fake pods to inject per controller:

Number of fake pods = number of controller desired pods - (scheduled pods + unschedualable pods + unprocessed pods)

The problem here is that scheduling gated pods are not being excluded along with unschedualable and unprocessed so proactive scale up injects fake pods for these gated pods ignoring the condition. that happens each loop which leads to not scaling down that empty space.

This PR subtracts the number of pods with scheduling gates from the number of fake pods to inject.

Which issue(s) this PR fixes:

NONE

Special notes for your reviewer:

  • ctx.AllPodLister().List() should be using a lister, so we are not having extra api call here to list the pods rather we just get it from cache.
  • In PodInjectionPodListProcessor in case of any error filtering out scheduling gated pods, I choose to only log a warning and not return an error so it doesn't block the rest of the processors as CombinedPodListProcessor stops the processing in case any one returns an error, which means any following processor (all others) will not be executed

Does this PR introduce a user-facing change?

NONE

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. kind/bug Categorizes issue or PR as related to a bug. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-area area/cluster-autoscaler labels Sep 29, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: abdelrahman882
Once this PR has been reviewed and has the lgtm label, please assign x13n for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Sep 29, 2025
@abdelrahman882 abdelrahman882 force-pushed the proactive-scaleup-schgates branch from 286ba74 to ff346b2 Compare September 29, 2025 09:29
@abdelrahman882 abdelrahman882 force-pushed the proactive-scaleup-schgates branch from ff346b2 to 4627816 Compare October 1, 2025 10:47
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Oct 1, 2025
@abdelrahman882 abdelrahman882 force-pushed the proactive-scaleup-schgates branch from 4627816 to 030ad2c Compare October 2, 2025 11:44
@abdelrahman882 abdelrahman882 requested a review from x13n October 2, 2025 12:04
@abdelrahman882 abdelrahman882 force-pushed the proactive-scaleup-schgates branch from 030ad2c to a281a40 Compare October 3, 2025 13:32
@abdelrahman882 abdelrahman882 requested a review from x13n October 3, 2025 13:58
for _, podOwnerRef := range pod.OwnerReferences {
// SchedulingGated pods can't be unschedualable nor unprocessed nor scheduled so it is not expected
// to have them as group sample nor in pod count, so decreasing desiredReplicas by one is enough
if grp, found := groups[podOwnerRef.UID]; found {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should only subtract when podOwnerRef.Controller != nil && *podOwnerRef.Controller to be in sync with updatePodGroups below. But actually, instead of adding replicas and then subtracting them here, wouldn't it suffice to pass scheduling gated pods to groupPods? We're already passing a combined list of scheduled and unschedulable pods there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tl;dr passed scheduling gated pods to groupPods with pods, they would be added to pod count and we wont inject fake pods for them

This should only subtract when podOwnerRef.Controller != nil && *podOwnerRef.Controller to be in sync with updatePodGroups below

Not really, I believe it would be redundant check because if the UID exists in the groups map then it refers to the same object that is pre checked to be controller, but will add as double check

But actually, instead of adding replicas and then subtracting them here, wouldn't it suffice to pass scheduling gated pods to groupPods? We're already passing a combined list of scheduled and unschedulable pods there.

We are not adding replicas and then subtracting them, we add twice to count and subtract from desired, The only reason I am doing that after grouping is to not do the listing and filtering to get gated pods in case we don't have any groups.

@abdelrahman882 abdelrahman882 force-pushed the proactive-scaleup-schgates branch from a281a40 to 341d253 Compare October 6, 2025 11:54
@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Oct 6, 2025
@abdelrahman882 abdelrahman882 requested a review from x13n October 6, 2025 20:55
Copy link
Member

@x13n x13n left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly looks good, just one last comment.

var podsToInject []*apiv1.Pod
allPods, err := ctx.AllPodLister().List()
if err != nil {
klog.Errorf("Failed to list all pods from all pod lister: %v", err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is redundant with the error returned - we'll end up logging the error twice. I'd just return the error and let the caller do the logging.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cluster-autoscaler cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. release-note-none Denotes a PR that doesn't merit a release note. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants